The objective of this workshop is to introduce statistics and data analysis using the R programming language. We will start by exploring what R is and to use it as a calculator, then we will analyse the penguins dataset.
Throughout this workshop, we will Rstudio. Rstudio is an integrated development environment, a software that includes everything we need to use R comfortably. Here’s a overview of the interface.
The Rstudio interface you have in front of you is divided in 4
parts:
Upper left: the script area. Here you can write several lines of code. To execute a bit of code written here, either select it and click on Run, or click on the line and press Ctrl + ↵
Lower left: the is the R console. Here can enter single commands. All executed commands (including those ran from the scipt area) will be sent here, and R responses will be displayed here.
Upper right: the environment panel. In this panel. This shows everything that has already been stored in the environment: data, variables, …
Lower right: file browser, plots, help on functions, …
All the necessary files are included in the folder we’ve provided to
you, called StatsWorkshop. We will work inside this folder
and define it as our working directory, meaning that everything
that we will generate and read will be from this place. Using the Files
panel in Rstudio, navigate to inside this folder. Then click on “More”
on the top of this panel, then “Set as working directory”. You will
notice that a command has appeared in the console panel: clicking on set
as working directory has instructed Rstudio to run a setwd
command. This is in fact another way to set the working directory, by
executing a command setwd("PATH/TO/THE/DIRECTORY"). If we
want a reminder of where R is currently working, we can request to know
the working directory with the command getwd().
In this module, we will see how we can use R as a (quite
powerful) calculator and learn some basics of coding in R that will be
necessary for the rest of the workshop.
\[ \\[1cm] \]
The console of R can be used as a basic calculator. For example, we can type the following commands, and R will compute and display the result.
3+2
## [1] 5
2+3*5
## [1] 17
(2+3)*5
## [1] 25
42*123
## [1] 5166
(15+3)/(143.7-17)
## [1] 0.1420679
2^11
## [1] 2048
R also knows standard function like square root
(sqrt), exponential (exp) or base 10 logarithm
(log10).
sqrt(125)
## [1] 11.18034
exp(10)
## [1] 22026.47
log10(1000)
## [1] 3
log2(8)
## [1] 3
What are the values of the following?
\[ log_{10}\left(9+{{2+3}\over{10-5}}\right) \]
\[ \sqrt{5^3 + 3 \times 7 - 2} \]
When you are ready to see the answer, click on the “Code” button on the right below this.
log10(9+(2+3)/(10-5))
## [1] 1
sqrt(5^3+3*7-2)
## [1] 12
\[ \\[1cm] \]
R can store values in variables. They are assigned using the
symbol <-.
a <- 3
b <- 5
a*b
## [1] 15
Variables can be overwritten:
a <- 42
b <- a-40
a
## [1] 42
b
## [1] 2
a <- 3
a
## [1] 3
b
## [1] 2
Note that in the above example, overwriting a did not
change the value of b: R does not go back to rerun older
lines of code!
You can name your variables with whatever name you want, with a
few exceptions that are protected names. Do not hesitate to give long,
clear names to your variables. R is a case-sensitive language, which
means that thisvariable, ThisVariable and
THISVARIABLE will all be considered to be different
names.
Importantly, variables can have different types. Such types include:
Integer, e.g. 3
Numeric, e.g. -3.14
Character, e.g. "This is a character". They
are defined between quotes, either "double" or
'simple'.
Logical, being either TRUE or FALSE. We
will cover these in more details in the next section.
There are other types, but these will be the most frequent ones. Note that not all operations are possible on all types:
character1 <- "Trying to"
character2 <- "add characters."
character1 + character2
## Error in character1 + character2: non-numeric argument to binary operator
Note: if you want to concatenate characters, the function to
do so is called paste:
character1 <- "Pasting characters"
character2 <- "works much better."
paste(character1, character2)
## [1] "Pasting characters works much better."
We will often need not only one value, but a set of values. To do
so, we can create vectors using c(). The
i-th value in the vector v can be accessed
using v[i].
v <- c(1,3,5,7,9)
v
## [1] 1 3 5 7 9
v[3]
## [1] 5
Vectors can contain any type of variable. However, they can only
contain variables that are from the same type. If you want to group
variables of different types, you need to use a structure called
lists, but we do not have the time to cover them in this
workshop.
Special case: if you want all integer values between two numbers,
there is a shortcut: start:end.
numberSequence <- 3:103
numberSequence
## [1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## [19] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## [37] 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
## [55] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
## [73] 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92
## [91] 93 94 95 96 97 98 99 100 101 102 103
\[ \\[1cm] \]
We can create some variables that are called logical or
booleans. This means that they can take only two values:
TRUE or FALSE, sometimes coded as 1 and 0.
These names are protected in R so you cannot override them by creating a
variable with these names. There are specific operations that we can
make on logical variables:
x AND y returns TRUE if both x and y are TRUE, returns FALSE otherwise
x OR y returns TRUE if at least one of x or y is TRUE, returns FALSE if both x and y are FALSE
NOT x returns the opposite of x: FALSE if is TRUE, TRUE if x is FALSE
In R, AND is coded by &, OR is coded by
| (above the \ symbol on the keyboard) and not
is coded by !.
t <- TRUE
f <- FALSE
t & f
## [1] FALSE
t | f
## [1] TRUE
!t
## [1] FALSE
!f
## [1] TRUE
with t being defined as TRUE and f as FALSE
(as above), what do the following evaluate to?
t AND NOT f
NOT (t OR f)
((NOT t) OR f) AND t
Before you evaluate them in R, try to see if you can figure them out by yourself.
When you are ready to see the answer, click on the “Code” button on the right below this.
t <- TRUE
f <- FALSE
t & !f
## [1] TRUE
!(t | f)
## [1] FALSE
((!t) | f) & t
## [1] FALSE
We can evaluate whether two variable have the same value with
== or if they are different with !=. These
operations return logical values:
a <- 3
a == 5
## [1] FALSE
b <- 7
a != b
## [1] TRUE
The logical variables can be used to run bits of code only when a
certain condition is met. To do so, we can use the
if() else structure:
if(CONDITION){*CODE TO RUN IF THE CONDITION IS MET*} else{*CODE TO RUN IF THE CONDITION IS NOT MET*}
Note that the else{} statement is optional, and we can
remove it if we don not need anything to be done if the condition is not
met. Here is an example:
a <- 10
if(a==10){print("a is 10")} else{print("a is not 10")}
## [1] "a is 10"
a <- 5
if(a==10){print("a is 10")} else{print("a is not 10")}
## [1] "a is not 10"
a <- 3
if(a > 2){a = a-1}
a
## [1] 2
\[ \\[1cm] \]
When we need a part of the code to be run several times, there is no need to copy and paste the same chunk of code several times (imagine if you need to run it 100,000 times!). There are structures called loops. The most frequent one is the for loop, which follows this structure:
for(variable in SET OF VALUES){
CODE TO BE RUN MULTIPLE TIMES,
ONE FOR EACH VALUE IN SET OF VALUES
THIS CODE CAN USE variables
}
Here are a few examples:
for(a in c(1,3,5,7,9)){
print(a)
}
## [1] 1
## [1] 3
## [1] 5
## [1] 7
## [1] 9
count <- 0
for(loop in 1:10){
count <- count +1
}
count
## [1] 10
sentence <- ""
for(word in c("One","word","at","a","time.")){
sentence <- paste(sentence,word)
}
sentence
## [1] " One word at a time."
Write a loop that prints every number form 1 to 15 but that replaces
3 and 9 by "foo" and 5 and 10 by "bar", giving
the following output:
## [1] 1
## [1] 2
## [1] "foo"
## [1] 4
## [1] "bar"
## [1] 6
## [1] 7
## [1] 8
## [1] "foo"
## [1] "bar"
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
When you are ready to see the answer, click on the “Code” button on the right below this.
for(number in 1:15){ # loop on all numbers from 1 to 15
toPrint <- number # by default, we will need to print the number we are at
if(number == 3 | number == 9){toPrint <- "foo"} # replace 3 and 9 by "foo"
if(number == 5 | number == 10){toPrint <- "bar"} # replace 5 and 10 by "bar"
print(toPrint) # print result
}
## [1] 1
## [1] 2
## [1] "foo"
## [1] 4
## [1] "bar"
## [1] 6
## [1] 7
## [1] 8
## [1] "foo"
## [1] "bar"
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
\[ \\[1cm] \]
It is possible to expand the functions already available in R or to import new datasets using packages. Packages are sets of functions and/or data written by R users that can be shared and used by anyone.
They can be distributed on different sites, with slightly different installation methods. The easiest way to know is to always visit the package’s webpage to look at the installation instruction.
For packages distributed through CRAN (the organism responsible for R
development), you can install packages using
install.packages("nameOfThePackage"). During the workshop
we will use functions from the packages tidyr,
palmerpenguins and factoMineR. To install
them, run the following:
install.packages("tidyr")
install.packages("psych")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("palmerpenguins")
install.packages("FactoMineR")
install.packages("hrbrthemes")
devtools::install_github("kassambara/factoextra")
This will install these packages on your computer. They need to
be loaded with library() to be usable in the R session:
library(tidyr)
library(palmerpenguins)
library(ggplot2)
library(FactoMineR)
library(cowplot)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
\[ \\[3cm] \]
During the rest of the workshop, we will analyze an example
dataset, that lists measurements of penguins observed in the Palmer
Archipelago (Antarctica) by Dr Kristen Gorman (University of Alaska).
For each animal included in the dataset, we have access to:
Its species
The island it was observed on
The length and depth of it bill (see illustration below)
Its flipper length
Its body mass
Its sex
The year it was observed
\[ \\[1cm] \]
The first step to access and analyse a dataset is to load it into the
R environment. We provided you with the palmerpenguins dataset as a
.csv file. This is a text format to represent a table where
cells are separated by a comma. To load data from a .csv file into the R
environment, we will use the read.table() function:
penguins <- read.table("PATH/penguins.csv",sep = ",", header = T)
getwd
penguins =
You can also choose the file interactively and then read the file on the saved file name. This is how you are going to do this: 1. f <- file.choose() 2. penguins <- read.csv(f)
Don’t forget to adjust the path to the file to allow R to find it.
the sep argument specifies what was used to separate the
table cells. The header parameter tells R that the first
line contains headers, i.e. column names.
We stored it in a variable called penguins. Its structure is different from other variables: this a table, called a data frame. Each individual penguin is on a different line, and each column is a different information on all penguins.
To check everything went well, we can have a look at the first few lines of the data frame:
head(penguins)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18.0 195 3250
## 4 Adelie Torgersen NA NA NA NA
## 5 Adelie Torgersen 36.7 19.3 193 3450
## 6 Adelie Torgersen 39.3 20.6 190 3650
## sex year
## 1 male 2007
## 2 female 2007
## 3 female 2007
## 4 <NA> 2007
## 5 female 2007
## 6 male 2007
This prints the first few lines. You will notice that some data are
labeled as NA. This means that this information is missing, something
that happens more often than you would think in real life datasets! R
encodes this with a special value NA.
\[ \\[1cm] \]
To have a quick look at what is in the dataset, we can use the
summary() function:
summary(penguins)
## species island bill_length_mm bill_depth_mm
## Length:344 Length:344 Min. :32.10 Min. :13.10
## Class :character Class :character 1st Qu.:39.23 1st Qu.:15.60
## Mode :character Mode :character Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2700 Length:344 Min. :2007
## 1st Qu.:190.0 1st Qu.:3550 Class :character 1st Qu.:2007
## Median :197.0 Median :4050 Mode :character Median :2008
## Mean :200.9 Mean :4202 Mean :2008
## 3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## NA's :2 NA's :2
Descriptive statistics may be a long gone memory, so here’s a
reminder:
Median is the value that splits the set in 2 halves of the same size. For example, the median body weight of our penguins is 4202g. This means that 50% of penguins have a body mass below or equal to 4202g, and 50% have a body mass above or equal to 4202g.
1st and 3rd quartiles are the values that split the set at 25% and 75%, respectively.
To have a more complete summary of the data, we can also use the
function describe() from the package
psych:
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(penguins)
## vars n mean sd median trimmed mad min max
## species* 1 344 1.92 0.89 2.00 1.90 1.48 1.0 3.0
## island* 2 344 1.66 0.73 2.00 1.58 1.48 1.0 3.0
## bill_length_mm 3 342 43.92 5.46 44.45 43.91 7.04 32.1 59.6
## bill_depth_mm 4 342 17.15 1.97 17.30 17.17 2.22 13.1 21.5
## flipper_length_mm 5 342 200.92 14.06 197.00 200.34 16.31 172.0 231.0
## body_mass_g 6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
## sex* 7 333 1.50 0.50 2.00 1.51 0.00 1.0 2.0
## year 8 344 2008.03 0.82 2008.00 2008.04 1.48 2007.0 2009.0
## range skew kurtosis se
## species* 2.0 0.16 -1.73 0.05
## island* 2.0 0.61 -0.91 0.04
## bill_length_mm 27.5 0.05 -0.89 0.30
## bill_depth_mm 8.4 -0.14 -0.92 0.11
## flipper_length_mm 59.0 0.34 -1.00 0.76
## body_mass_g 3600.0 0.47 -0.74 43.36
## sex* 1.0 -0.02 -2.01 0.03
## year 2.0 -0.05 -1.51 0.04
To access one of the columns of a data frame, we can use the
symbol $, with this syntax:
dataframe$columnname
penguins$species
## [1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [7] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [13] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [19] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [25] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [31] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [37] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [43] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [49] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [55] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [61] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [67] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [73] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [79] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [85] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [91] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [97] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [103] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [109] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [115] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [121] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [127] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [133] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [139] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [145] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [151] "Adelie" "Adelie" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [157] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [163] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [169] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [175] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [181] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [187] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [193] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [199] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [205] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [211] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [217] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [223] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [229] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [235] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [241] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [247] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [253] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [259] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [265] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [271] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [277] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [283] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [289] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [295] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [301] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [307] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [313] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [319] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [325] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [331] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [337] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [343] "Chinstrap" "Chinstrap"
penguins$bill_length_mm
## [1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
## [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
## [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
## [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
## [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
## [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
## [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9
## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1
## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1
## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8
## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5
## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8
## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3
## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5
## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2
## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3
## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2
## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8
## [271] 47.2 NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0
## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6
## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9
## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7
## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2
If we want to count data, we can use the function
table(). We can ask it for counts of one or two variables.
The useNA parameter is used to indicate R whether to
display the number of NAs. It must be one of "no",
"ifany" or "always", to respectively not show
them, show them only if there are NAs, or to always show them, even if
there are none. If not specified, it will default to “no”. Optionnaly,
we can also name the variable we want to count.
table(penguins$sex, useNA = "ifany")
##
## female male <NA>
## 165 168 11
table(species = penguins$species, island = penguins$island)
## island
## species Biscoe Dream Torgersen
## Adelie 44 56 52
## Chinstrap 0 68 0
## Gentoo 124 0 0
Using the table() function, find the answer to the following:
How many penguins were observed on Dream Island?
How many females were observed in 2008?
How many missing entries (NAs) are there in the island column?
When you are ready to see the answer, click on the “Code” button on the right below this.
table(penguins$island)
##
## Biscoe Dream Torgersen
## 168 124 52
# 124 penguins observed on Dream island
table(sex = penguins$sex, year = penguins$year)
## year
## sex 2007 2008 2009
## female 51 56 58
## male 52 57 59
# 56 females observed in 2008
table(penguins$island,useNA = "always")
##
## Biscoe Dream Torgersen <NA>
## 168 124 52 0
# 0 NAs for the island column
\[ \\[1cm] \]
Perhaps we don’t want to look at all available penguins but maybe we are only interested in Adelie penguins. For this and many different analyses, we will need to only look at a subet of the dataset. There are several ways to do so. The first one is using the subset() function:
adelie <- subset(penguins, species=="Adelie")
table(species = adelie$species, island = adelie$island)
## island
## species Biscoe Dream Torgersen
## Adelie 44 56 52
If we want to specify several possible value, can use the constructor
%in%:
biscoeDream <- subset(penguins, island %in% c("Biscoe","Dream"))
table(species = adelie$species, island = adelie$island)
## island
## species Biscoe Dream Torgersen
## Adelie 44 56 52
We can also put several conditions:
malesTorgersen <- subset(penguins, island == "Torgersen" & sex == "Male")
table(sex = adelie$sex, island = adelie$island)
## island
## sex Biscoe Dream Torgersen
## female 22 27 24
## male 22 28 23
Another way to subset the data is to use the
filter() function from the package dplyr. It
works similarly to subset().
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
adelieFemales <- filter(penguins, species == "Adelie", sex == "female")
We can also specify directly which lines and/or columns to keep. To
do so, we use the structure dataframe[lines, columns]. If
either lines or columns are left empty, this means that we want to keep
all lines or columns, respectively. We can specify which lines / columns
to keep either by giving a vector of logicals, numbers, or names:
subset <- penguins[penguins$island == "Biscoe", c("species","island","body_mass_g")] #keep only lines that have "Biscoe" as island, and keep only columns "species", "island" and "body_mass_g"
subset <- penguins[25:40,] # keep only lines 25 to 40, and all columns
Another way to subset the data is to use the filter()
function from the package dplyr. It works similarly to
subset().
library(dplyr)
adelieFemales <- filter(penguins, species == "Adelie", sex == "female")
Using table() or summary() on a subset of
the penguins data frame, find out the answer to the following
questions:
What is the mean body weight of female Gentoo penguins?
What is the smallest observed flipper length for Chinstrap penguins in 2007?
How many Adelie penguins weighing less than 4,000 g were observed on Dream Island?
When you are ready to see the answer, click on the “Code” button on the right below this.
summary(subset(penguins,sex=="female" & species == "Gentoo"))
## species island bill_length_mm bill_depth_mm
## Length:58 Length:58 Min. :40.90 Min. :13.10
## Class :character Class :character 1st Qu.:43.85 1st Qu.:13.80
## Mode :character Mode :character Median :45.50 Median :14.25
## Mean :45.56 Mean :14.24
## 3rd Qu.:46.88 3rd Qu.:14.60
## Max. :50.50 Max. :15.50
## flipper_length_mm body_mass_g sex year
## Min. :203.0 Min. :3950 Length:58 Min. :2007
## 1st Qu.:210.0 1st Qu.:4462 Class :character 1st Qu.:2007
## Median :212.0 Median :4700 Mode :character Median :2008
## Mean :212.7 Mean :4680 Mean :2008
## 3rd Qu.:215.0 3rd Qu.:4875 3rd Qu.:2009
## Max. :222.0 Max. :5200 Max. :2009
# Mean body mass: 4680g
summary(subset(penguins,year==2007 & species == "Chinstrap"))
## species island bill_length_mm bill_depth_mm
## Length:26 Length:26 Min. :42.40 Min. :16.60
## Class :character Class :character 1st Qu.:46.17 1st Qu.:17.80
## Mode :character Mode :character Median :48.85 Median :18.20
## Mean :48.72 Mean :18.48
## 3rd Qu.:51.30 3rd Qu.:19.35
## Max. :58.00 Max. :20.30
## flipper_length_mm body_mass_g sex year
## Min. :178.0 Min. :2900 Length:26 Min. :2007
## 1st Qu.:190.0 1st Qu.:3506 Class :character 1st Qu.:2007
## Median :193.5 Median :3700 Mode :character Median :2007
## Mean :192.4 Mean :3694 Mean :2007
## 3rd Qu.:197.0 3rd Qu.:3875 3rd Qu.:2007
## Max. :201.0 Max. :4400 Max. :2007
# Minimum flipper length: 178.8mm
table(penguins[penguins$species=="Adelie" & penguins$body_mass_g < 4000, "island"])
##
## Biscoe Dream Torgersen
## 33 42 37
# 42 Adelie penguins weighing less than 4,000g on Dream island
\[ \\[1cm] \]
The summary() function allows to have a quick look at
what’s in our dataset, but returns many different output. We can instead
directly compute descriptive statistics using the following
functions:
mean() for the mean
median() for the median
quantile() for any given quantile (generalization of
the quartile concept to other fractions than quarters). We indicate the
quantile we want with the parameter probs, expressed
between 0 and 1. Note that the median is equivalent to
quantile(x, probs = 0.5).
sd() for standard deviation, a measure of data
variability
By default, all these functions will return NA if the
data contains NAs. To ignore NAs and return descriptive statistics of
the non-missing data, we can use the parameter
na.rm = TRUE.
mean(penguins$body_mass_g)
## [1] NA
mean(penguins$body_mass_g, na.rm = TRUE)
## [1] 4201.754
median(penguins$flipper_length_mm, na.rm = TRUE)
## [1] 197
quantile(penguins$flipper_length_mm, probs = c(0.10,0.25,0.5,0.75,0.90), na.rm = T)
## 10% 25% 50% 75% 90%
## 185.0 190.0 197.0 213.0 220.9
sd(penguins$bill_length_mm, na.rm = TRUE)
## [1] 5.459584
Before the next exercise, let’s add a new column to the dataset that will contain the bill length to depth ratio:
penguins$bill_ratio <- penguins$bill_length_mm / penguins$bill_depth_mm
What is the mean and standard deviation of the body mass of Adelie penguins?
What is the median flipper length of female penguin?
What are the quartiles and median bill length to depth ratios for Chinstrap penguins?
When you are ready to see the answer, click on the “Code” button on the right below this.
mean(penguins[penguins$species == "Adelie","body_mass_g"], na.rm = TRUE)
## [1] 3700.662
sd(penguins[penguins$species == "Adelie","body_mass_g"], na.rm = TRUE)
## [1] 458.5661
# Mean: 3700.662, sd: 458.5661
median(penguins[penguins$sex == "female","flipper_length_mm"], na.rm = TRUE)
## [1] 193
# Median: 193
quantile(penguins[penguins$species == "Chinstrap","bill_ratio"], probs = c(0.25,0.5,0.75), na.rm = TRUE)
## 25% 50% 75%
## 2.559776 2.661577 2.728167
# 1st quartile 2.556, median 2.662, 3rd quartile 2.728
\[ \\[3cm] \]
The summary() function allows to have a quick look at
what’s in our dataset, but returns many different output. We can instead
directly compute descriptive statistics using the following
functions:
mean() for the mean
median() for the median
quantile() for any given quantile (generalization of
the quartile concept to other fractions than quarters). We indicate the
quantile we want with the parameter probs, expressed
between 0 and 1. Note that the median is equivalent to
quantile(x, probs = 0.5).
sd() for standard deviation, a measure of data
variability
By default, all these functions will return NA if the
data contains NAs. To ignore NAs and return descriptive statistics of
the non-missing data, we can use the parameter
na.rm = TRUE.
mean(penguins$body_mass_g)
## [1] NA
mean(penguins$body_mass_g, na.rm = TRUE)
## [1] 4201.754
median(penguins$flipper_length_mm, na.rm = TRUE)
## [1] 197
quantile(penguins$flipper_length_mm, probs = c(0.10,0.25,0.5,0.75,0.90), na.rm = T)
## 10% 25% 50% 75% 90%
## 185.0 190.0 197.0 213.0 220.9
sd(penguins$bill_length_mm, na.rm = TRUE)
## [1] 5.459584
Before the next exercise, let’s add a new column to the dataset that will contain the bill length to depth ratio:
penguins$bill_ratio <- penguins$bill_length_mm / penguins$bill_depth_mm
What is the mean and standard deviation of the body mass of Adelie penguins?
What is the median flipper length of female penguin?
What are the quartiles and median bill length to depth ratios for Chinstrap penguins?
When you are ready to see the answer, click on the “Code” button on the right below this.
mean(penguins[penguins$species == "Adelie","body_mass_g"], na.rm = TRUE)
## [1] 3700.662
sd(penguins[penguins$species == "Adelie","body_mass_g"], na.rm = TRUE)
## [1] 458.5661
# Mean: 3700.662, sd: 458.5661
median(penguins[penguins$sex == "female","flipper_length_mm"], na.rm = TRUE)
## [1] 193
# Median: 193
quantile(penguins[penguins$species == "Chinstrap","bill_ratio"], probs = c(0.25,0.5,0.75), na.rm = TRUE)
## 25% 50% 75%
## 2.559776 2.661577 2.728167
# 1st quartile 2.556, median 2.662, 3rd quartile 2.728
\[ \\[3cm] \]
To have a better sense of any dataset, plotting some results can be very useful. In this module, we’ll learn how to draw simple graphics using the base R functions. Those are rather rudimentary but are easier to use and still allow some flexibility. If you want to create more refined graphics using R, we encourage you to have a look at what ggplot2 can offer! We’ll give a ggplot2 example for each plot type if you want to learn more.
One of the first graphs we can have a look at are histograms. These
allow to have a rapid glance at the distribution of our data. We can
draw them using the hist function. We can control the
number of bars with breaks, which either takes the numeric
values at which to break, or a target number of bars.
hist(penguins$bill_length_mm)
hist(penguins[penguins$species == "Gentoo", "body_mass_g"], breaks = 10)
We’re probably thinking that these plots are fairly ugly. And you’d
be right. Let’s improve them. First, we can modify x axis label using
the xlab parameter and a title using main.
hist(penguins[penguins$species == "Gentoo", "body_mass_g"], breaks = 10, xlab = "Body mass (g)", main = "Body mass distribution of Gentoo penguins")
Then we can change the grey color to anything we’d like. This is
controlled with the col parameter. We can either set a
frequent color name such as "red" or "blue",
or input the hexadecimal code of the color after #:
hist(penguins[penguins$species == "Gentoo", "body_mass_g"], breaks = 10, xlab = "Body mass (g)", main = "Body mass distribution of Gentoo penguins", col = "#057803")
Draw a histogram of the flipper length of female penguins, with a proper x axis label, a title and color it anything else than grey.
When you are ready to see the answer, click on the “Code” button on the right below this.
If you want an example of what ggplot2 can offer instead of the base R functions:
library(ggplot2)
ggplot(penguins, mapping = aes(x = body_mass_g, fill = species)) + # draw a ggplot2 figure on the penguins data frame, where the x axis is the body mass, and the colours depend on the species
geom_histogram(alpha = 0.7, position = "identity", bins = 20) + # draw a histogram, with 30% transparence and overlap them
labs(title = "Penguins body mass distribution", subtitle = "By species", x = "Body mass (g)", y = "Count") + # Set title and labels
scale_fill_manual(values = c("#ff1111","#bf12cb","#067275")) + # specify the colour scheme
theme_bw() # set the theme
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_bin()`).
\[ \\[1cm] \]
Another simple graph type are scatterplots. These are points drawn from their (x,y) coordinates. They can help to see if two numerical variables are associated. Here are some examples:
plot(x = penguins$bill_length_mm, y = penguins$bill_depth_mm)
plot(x = penguins$flipper_length_mm, y = penguins$body_mass_g)
These plots are again not very eye-pleasing. Let’s tweak them.
Similarly as above, we can add labels to each axis. Let’s put dots
instead of circles with the parameter pch (you can find the whole list
of possible values here).Let’s
also color the points depending on each penguin’s species. To do so, we
will use the parameter color. We need to send him a vector
of colors, with one color for each point. The easiest way to do so is
the following:
col = c("Adelie" = "#ff7900", "Chinstrap" = "#bf5ccb", "Gentoo" = "#067275")[penguins$species].
Finally, using the function legend(), we can add a legend
to our plot.
plot(x = penguins$flipper_length_mm, y = penguins$body_mass_g, pch = 16, xlab = "Flipper length (mm)", ylab = "Body mass (g)", col = c("Adelie" = "#ff7900", "Chinstrap" = "#bf5ccb", "Gentoo" = "#067275")[penguins$species])
legend("topleft",legend = c("Adelie","Chinstrap","Gentoo"),col = c("#ff7900","#bf5ccb","#067275"), pch = 16)
Draw a scatterplot representing the body mass of penguins on the x axis and their bill length on the y axis. Color it depending on the islands, and put different symbols for male and female penguins. Add proper labels, a legend for the islands on the bottom right of the plot and a legend for the sex on the top left.
When you are ready to see the answer, click on the “Code” button on the right below this.
\[ \\[1cm] \]
To represent the different possible values taken by a variable, we can draw boxplots. Here is a boxplot example for the flipper length of our penguins:
boxplot(penguins$flipper_length_mm)
The large horizontal bar represents the median. The box itself extends from the first to the third quartile. The whiskers try to extend to the min/max but will not extend for more than 1.5 times the interquartile range (the distance between 1st and 3rd quartiles). If some data points are outside of this range (outliers), they are drawn individually.
We can specify that we want different boxes for different
modalities associated with a given categorical variable using the symbol
~:
boxplot(penguins$body_mass_g ~ penguins$species)
boxplot(penguins$flipper_length_mm ~ penguins$island)
Once again, we can make the graphs nicer using parameters such as
col, xlab, ylab or
main. We can also use ylim to manually specify
the limits of the y axis:
boxplot(penguins$body_mass_g ~ penguins$island, col = c("darkgreen","darkblue","darkred"), xlab = "Island", ylab = "Body mass (g)", main = "Body mass by island", ylim = c(0,7000))
We can also use the geom_jitter() argument from the ggplot2 package, add individual observation on top of boxes. This function shifts all dots by a random value ranging from 0 to size, avoiding overlaps.
library(hrbrthemes)
ggplot(penguins, aes(x=island, y=body_mass_g, fill=island)) +
geom_boxplot() +
scale_fill_manual(values = c("#ff7900","#bf5ccb","lightblue")) +
geom_jitter(color= "black", size=1, alpha=0.8) +
theme_bw() +
labs(title = "Body mass", subtitle = "by island", , y = "Body mass (g)", tag = "Figure 1a") +
theme(plot.title = element_text(color="darkblue", size=25, face="bold.italic"),
axis.title.y = element_text(color="darkred", size=14, face="bold"))
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
Draw a boxplot showing the bill length by sex. Add some labels and colors, and start the y axis at 0.
When you are ready to see the answer, click on the “Code” button on the right below this.
\[
\\[1cm]
\]
Heatmaps are graphical representations of data, which use a
color-code to represent their different values. Colour variation is
analogous to the intensity of the data values,a nd they are used to
visually inform the reader about a phenomenon.
z-standarization is
commonly used in heatmap generation.
What is data z-standarization?
It is a statistical
procedure used to make data points from different datasets/samples
comparable. In this procedure, each data point is converted into a
z-score. A z-score indicates how many standard deviations a data point
is from the mean of the dataset.
How do we calculate the z-score?
z-score calculation formula.
z-value is the value we want to calculate.
x is the observed
value.
μ is the mean value of the samples (per row).
σ is the
standard deviation of the sample.
Be cautious, as μ and σ are
actually the mean and the standard deviation of the population, but in
our case we have a limited number of samples. Please search online how
somebody can estimate , under certain conditions, the mean and the
standard deviation using a sample.
Here, you can see 2 ways to vizualize transcriptomic data, using
standarization. Initially, we use the heatmap function from base R.
Additionally, we use the pheatmap package, which is commonly used
between researchers when they publish their work.
A 3rd package,
which is used for more complicated heatmaps is the following :
ComplexHeatmap
library(pheatmap)
library(RColorBrewer)
expr.data <- read.table("RNAseqData.csv", sep = ",", header = T)
# expr.data <- read.table("<PATH_to_RNAseqData/RNAseqData.csv", sep = ",", header = T)
rownames(expr.data) = expr.data[,1]
# This is a comment
# expr.data =as.matrix(expr.data[,-1])
expr.data = as.matrix(expr.data[,2:ncol(expr.data)])
heatmap(expr.data, scale = "none")
heatmap(expr.data, scale = "row")
pheatmap(expr.data, fontsize_row = 5)
pheatmap(expr.data, color = colorRampPalette(rev(brewer.pal(n=10, name = "RdBu"))) (100),
border_color = "grey90",
scale = "row",
cellwidth = 20,
cellheight = 5,
cluster_rows = TRUE,
cluster_columns = TRUE,
show_colnames = TRUE,
show_rownames = TRUE,
fontsize_row = 5,
fontsize_col = 10,
fontsize = 10,
na_col = "black",
legend = TRUE)
\[ \\[1cm] \]
Volcano plots are commonly used graphs to show the results of omics
experiments. They are used regularly in RNAseq experiments to diplay
Differentially Expressed Genes (DEGs), as somebody can easily identify
biologically significantly changing genes.
They are scatterplots
that show statistical significance (p-val) vs magnitude of change (FC -
Fold Change). Usually, The up-regulated genes are towards the right,
while the down-regulated towards the left.
They are scatterplots, widely used in GWAS (Genome Wide Association Study). Every point is associated with a genetic variant. The x-axis shows the position of the variant on a chromosome, while the y-axis displays the level of its association with a trait.
Venn diagrams are widely used graphs that show the logical relation between sets. Became popular by John Venn (1834-1923). A Venn diagram shows all the possible logical relations between a finite collection of different sets. It is made of two or more circles that sometimes overlap.
\[
\\[1cm]
\]
Ref: https://infogram.com/blog/choose-the-right-chart/
\[ \\[1cm] \]
We will focus on our penguins data creating subsets of them to see
their distribution. We will step-by-step check the “rules” that need to
be followed in order to perform different types of tests to check if
they are “equal” or not. When we test for “equality” we usually check
differences in the mean values of different groups in our dataset. Let’s
start by trying to collect the data that will be used in our analysis,
by focusing on the penguins from the “Adelie” and “Gentoo” species. We
start by subsetting the dataset using filter. You will note
in this example a different structure: |> is called a
pipe. It tells R to send the object specified before
|> as the main argument of the following function. It
can be used to make the code clearer to read.
library(dplyr)
data_to_test <- penguins
data_to_test <- data_to_test |> filter(species %in% c("Adelie","Gentoo"), !is.na(body_mass_g))
Let’s have a quick look of our data using the summary()
and describe() functions. What is the output of these
functions? What does it mean?
library(psych)
summary(data_to_test)
## species island bill_length_mm bill_depth_mm
## Length:274 Length:274 Min. :32.10 Min. :13.10
## Class :character Class :character 1st Qu.:38.35 1st Qu.:15.00
## Mode :character Mode :character Median :42.00 Median :17.00
## Mean :42.70 Mean :16.84
## 3rd Qu.:46.67 3rd Qu.:18.50
## Max. :59.60 Max. :21.50
## flipper_length_mm body_mass_g sex year
## Min. :172.0 Min. :2850 Length:274 Min. :2007
## 1st Qu.:190.0 1st Qu.:3600 Class :character 1st Qu.:2007
## Median :198.0 Median :4262 Mode :character Median :2008
## Mean :202.2 Mean :4318 Mean :2008
## 3rd Qu.:215.0 3rd Qu.:4950 3rd Qu.:2009
## Max. :231.0 Max. :6300 Max. :2009
## bill_ratio
## Min. :1.640
## 1st Qu.:2.106
## Median :2.323
## Mean :2.594
## 3rd Qu.:3.137
## Max. :3.613
describe(data_to_test)
## vars n mean sd median trimmed mad min
## species* 1 274 1.45 0.50 1.00 1.44 0.00 1.00
## island* 2 274 1.58 0.79 1.00 1.47 0.00 1.00
## bill_length_mm 3 274 42.70 5.20 42.00 42.54 6.23 32.10
## bill_depth_mm 4 274 16.84 2.01 17.00 16.80 2.45 13.10
## flipper_length_mm 5 274 202.18 15.05 198.00 201.84 17.79 172.00
## body_mass_g 6 274 4318.07 835.93 4262.50 4293.98 1000.76 2850.00
## sex* 7 265 1.51 0.50 2.00 1.51 0.00 1.00
## year 8 274 2008.04 0.81 2008.00 2008.05 1.48 2007.00
## bill_ratio 9 274 2.59 0.55 2.32 2.58 0.62 1.64
## max range skew kurtosis se
## species* 2.00 1.00 0.20 -1.97 0.03
## island* 3.00 2.00 0.89 -0.81 0.05
## bill_length_mm 59.60 27.50 0.30 -0.68 0.31
## bill_depth_mm 21.50 8.40 0.09 -0.97 0.12
## flipper_length_mm 231.00 59.00 0.16 -1.27 0.91
## body_mass_g 6300.00 3450.00 0.22 -1.00 50.50
## sex* 2.00 1.00 -0.02 -2.01 0.03
## year 2009.00 2.00 -0.08 -1.46 0.05
## bill_ratio 3.61 1.97 0.20 -1.64 0.03
\[ \\[1cm] \]
How can we understand that our data follow the normal
distribution as seen above?
Firstly, we can plot a histogram of our
data to check how they look. In R, we do the following to draw the
histograms of the body mass by species using ggplot2:
ggplot(aes(x=body_mass_g), data = data_to_test) +
geom_histogram(binwidth = 200) +
facet_wrap(~species) +
theme_bw()
It seems that our data, after choosing a right bin size, follow
the normal distribution, without major outliers.
Secondly, we need to check the Q-Q (quantile-quantile) plots. In normally distributed data, these plots should fall along the 1:1 line. The most simple way to do it is the following
qqnorm(data_to_test$body_mass_g)
qqline(data_to_test$body_mass_g)
To have a more clear picture we can vizualize two plots for each
species.
data_to_test |>
ggplot(aes(sample = body_mass_g )) +
geom_qq() +
geom_qq_line(colour ='darkred') +
facet_wrap(~species) +
theme_bw()
Last but not least, we need to check that the variances of the
two groups are equal. If, also the variances are equal, then we can
assume that the data follow the normal distribution, and we can perform
a parametric test to check for their differences. To do that we will
perform the Levene’s test for equality of variances. We will use the
levene_test() function from the rstatix
package.
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
# check for equality of variance with Levene's test
data_to_test |> levene_test(body_mass_g ~ species)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## # A tibble: 1 × 4
## df1 df2 statistic p
## <int> <int> <dbl> <dbl>
## 1 1 272 1.99 0.159
What is the p-value from the test? Is it less than 0.05? If the p-value is > 0.05, it means that the variances are not significantly different.
\[ \\[1cm] \]
When we want to compare the difference between two means, we conduct a t-test. In a t-test, we test the initial hypothesis (H0), that the means are equal. Practically, after conducting the t-test, we check the p-value, and if it is < 0.05, we reject the null hypothesis, accepting an alternative hypothesis, that the means are significantly different.
There are several types of t-test, which include:
One-sample t-test. Here, we compare the mean of a sample to a known value. In our example, we could compare the body mass of our lovely penguins to the literature’s standard.
Unpaired two-sample t-test Here, we compare the mean of two independent groups. In our example, we could compare the mean value of body mass of the male penguins to the female ones. Alternatively, the body mass of the body mass of the Adlie penguins can be compared to the Gentoo penguins.
Paired t-test This test is used to compare the mean values of two RELATED groups of samples. For example, it could be used when we compare the average weight of a sample (weight of polar bears) before and after winter.
To compare the difference of means in more than 2 samples (3 and more), the parametric test used is the One way Analysis of Variance, but it will not be covered in this workshop.
In our example, we can conclude that the data follow the normal
distribution, therefore we will perform a parametric test to compare the
mean value of the body mass in our two groups, Adelie and Gentoo
penguins.
R’s t.test() function perform’s a Welch’s t-test by
default. A Welch’s t-test does not assume equal variance. It should be
used if there is any doubt about equal variance.
In R to perform a
two-sample t-test, we need to add the option var.equal =TRUE
In base
R we do the following:
t.test(data_to_test$body_mass_g ~ data_to_test$species)
##
## Welch Two Sample t-test
##
## data: data_to_test$body_mass_g by data_to_test$species
## t = -23.386, df = 249.64, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Adelie and group Gentoo is not equal to 0
## 95 percent confidence interval:
## -1491.183 -1259.525
## sample estimates:
## mean in group Adelie mean in group Gentoo
## 3700.662 5076.016
t.test(data_to_test$body_mass_g ~ data_to_test$species, var.equal = TRUE)
##
## Two Sample t-test
##
## data: data_to_test$body_mass_g by data_to_test$species
## t = -23.614, df = 272, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Adelie and group Gentoo is not equal to 0
## 95 percent confidence interval:
## -1490.021 -1260.687
## sample estimates:
## mean in group Adelie mean in group Gentoo
## 3700.662 5076.016
Using the dplyr package: Theory behind using t_test() function
One-sample test ::::::::::::::::::::::::::::::::::::::::: df %>%
t_test(df$columnx ~ 1, mu = 0)
Two-samples unpaired test ::::::::::::::::::::::::::::::::::::::::: df %>% t_test(df\(columnx ~ df\)columny)
Two-samples paired test ::::::::::::::::::::::::::::::::::::::::: df %>% t_test (df\(columnx ~ df\)columny, paired = TRUE)
library(dplyr)
data_to_test |>
t_test(body_mass_g ~ species)
## # A tibble: 1 × 8
## .y. group1 group2 n1 n2 statistic df p
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
## 1 body_mass_g Adelie Gentoo 151 123 -23.4 250. 7.71e-65
What do we conclude? Is the body mass of Adelie and Gentoo
penguins significantly different?
Let’s create a nice visualization that includes the graphs we created when we checked for normality, together with boxplots that show the body mass of our two groups.
p1 <- ggplot(aes(x=body_mass_g), data = data_to_test) +
geom_histogram(binwidth = 200) +
facet_wrap(~species) +
theme_bw()
p2 <- data_to_test |>
ggplot(aes(sample = body_mass_g )) +
geom_qq() +
geom_qq_line(colour ='darkred') +
facet_wrap(~species) +
theme_bw()
p3 <- data_to_test |>
ggplot(aes(x = species,
y = body_mass_g)) +
geom_boxplot(aes(fill = species)) +
scale_fill_manual(values = c("darkorange","cyan4")) +
geom_jitter( alpha =0.5) +
theme_bw()
library(patchwork)
##
## Attaching package: 'patchwork'
## The following object is masked from 'package:cowplot':
##
## align_plots
p1/ p2 | p3
The previous was an example of an unpaired two-sample t-test.
But let’s also perform a one sample t-test where we will compare
the body mass of penguins, regardless the island the come from, to the
literature’s standard body mass of penguins in Antarctica, which is
3700kg.
t.test(penguins$body_mass_g, mu = 3700, alternative = "two.sided")
##
## One Sample t-test
##
## data: penguins$body_mass_g
## t = 11.571, df = 341, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 3700
## 95 percent confidence interval:
## 4116.458 4287.050
## sample estimates:
## mean of x
## 4201.754
Does the body mass of our sample’s data significantly differ from the literature’s standard value? If, instead of 3700, we compare the body mass of the penguins’ dataset to 4200, what would be your conclusion? Would it be the same? \[ \\[1cm] \]
When our data are not “normal” (normally distributed), instead of a parametric test we could use a non-parametric alternative. More specifically, a non-parametric test does not assume anything about the underlying distribution of our data.
In our example, in case we haven’t performed a ‘normality’ check, we can perform a Mann-Whitney U test, which is the non-parametric alternative to the unpaired t-test.
#performs a Mann-Whitney-Wilcoxon
wilcox.test(data_to_test$body_mass_g ~ data_to_test$species, paired = FALSE)
##
## Wilcoxon rank sum test with continuity correction
##
## data: data_to_test$body_mass_g by data_to_test$species
## W = 400.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
# to perform an alternative to a paired t-test, use the Wilcoxon signed-rank test, also known as Wilcoxon-Whitney-Wilcoxon test, where option paired = TRUE
The Mann–Whitney U test / Wilcoxon rank-sum test is not the same
as the Wilcoxon signed-rank test, although both are nonparametric and
involve summation of ranks. The Mann–Whitney U test is applied to
independent samples. The Wilcoxon signed-rank test is applied to matched
or dependent samples. \[
\\[1cm]
\]
Check if the the flipper length of female penguins is different from the male ones. Don’t forget to first check for normality!
When you are ready to see the answer, click on the “Code” button on the right below this.
What happens when we are comparing between more than 2 groups? For parametric testing, we would perform one-way analysis of variance (ANOVA), which we will not cover during this workshop. For non-parametric test, we will use the Kruskal-Wallis test. For instance, let’s see if the bill length is significantly different between islands:
kruskal.test(penguins$bill_length_mm ~ penguins$island)
##
## Kruskal-Wallis rank sum test
##
## data: penguins$bill_length_mm by penguins$island
## Kruskal-Wallis chi-squared = 52.438, df = 2, p-value = 4.103e-12
Since the p-value is < 0.05, we can conclude that the bill
length of penguins is significantly different depending on the islands.
We can visualise the data as well:
penguins |>
ggplot(aes(x = island,
y = bill_length_mm)) +
geom_boxplot(aes(fill = island)) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Bill length by island", x = "Island", y = "Bill length (mm)") +
theme_bw()
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
Using a non-parametric test, determine if the flipper length is
significantly between the three species. Either using the base
boxplot function or using ggplot2, draw a
boxplot to visualise the data.
When you are ready to see the answer, click on the “Code” button on the right below this.
kruskal.test(penguins$flipper_length_mm ~ penguins$species)
##
## Kruskal-Wallis rank sum test
##
## data: penguins$flipper_length_mm by penguins$species
## Kruskal-Wallis chi-squared = 244.89, df = 2, p-value < 2.2e-16
boxplot(penguins$flipper_length_mm ~ penguins$species, xlab = "Species", ylab = "Flipper length (mm)", main = "Flipper length by species", col = c("darkorange","purple","cyan4"))
\[ \\[1cm] \]
The following table will help you to decide which statistical test to use based on your data:
| n of Groups to compare | Parametric Test | Non-parametric test |
|---|---|---|
| Two (unpaired) | Un-paired t-test | Mann-Whitney U test, also known as Mann-Whitney-Wilcoxon |
| Two (paired) | Paired t-test | Wilcoxon signed-rank test, also known asWilcoxon-Whitney-wilcoxon |
| Three or more | One way Analysis of Variance | Kruskal-Wallis |
\[ \\[3cm] \]